Loan Data Exploration by Daniel Cummings

This report explores a dataset provided by Prosper containing loan data for approximately 114,000 loans. One area of focus will be on how different variables such as loan amount, credit grade, and borrower monthly income relate to the APR (annual percentage rate) of a loan.

Univariate Plots Section

# DataFrame structure
str(df)
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

This data set contains 113,937 loans, with 81 features. Due to the large number of features, I’ll limit the scope of the data and focus on how specific borrower features (e.g. occupation, income, credit grade, etc.) impact the annual percentage rate (APR).

summary(df)
##  AvailableBankcardCredit  BorrowerAPR      CreditGrade   
##  Min.   :     0          Min.   : 0.653   C      : 5649  
##  1st Qu.:   880          1st Qu.:15.629   D      : 5153  
##  Median :  4100          Median :20.976   B      : 4389  
##  Mean   : 11210          Mean   :21.883   AA     : 3509  
##  3rd Qu.: 13180          3rd Qu.:28.381   HR     : 3508  
##  Max.   :646285          Max.   :51.229   (Other): 6604  
##  NA's   :7544            NA's   :25       NA's   :85125  
##  CurrentCreditLines CurrentDelinquencies CreditScoreAvg 
##  Min.   : 0.00      Min.   : 0.0000      Min.   :  9.5  
##  1st Qu.: 7.00      1st Qu.: 0.0000      1st Qu.:669.5  
##  Median :10.00      Median : 0.0000      Median :689.5  
##  Mean   :10.32      Mean   : 0.5921      Mean   :695.1  
##  3rd Qu.:13.00      3rd Qu.: 0.0000      3rd Qu.:729.5  
##  Max.   :59.00      Max.   :83.0000      Max.   :889.5  
##  NA's   :7604       NA's   :697          NA's   :591    
##       EmploymentStatus IsBorrowerHomeowner           ListingCategory 
##  Employed     :67322   False:56459         Debt Consolidation:58308  
##  Full-time    :26355   True :57478         Not Available     :16965  
##  Self-employed: 6134                       Other             :10494  
##  Not available: 5347                       Home Improvement  : 7433  
##  Other        : 3806                       Business          : 7189  
##               : 2255                       Auto              : 2572  
##  (Other)      : 2718                       (Other)           :10976  
##    LoanAmount                    LoanStatus    MonthlyLoanPayment
##  Min.   : 1000   Current              :56576   Min.   :   0.0    
##  1st Qu.: 4000   Completed            :38074   1st Qu.: 131.6    
##  Median : 6500   Chargedoff           :11992   Median : 217.7    
##  Mean   : 8337   Defaulted            : 5018   Mean   : 272.5    
##  3rd Qu.:12000   Past Due (1-15 days) :  806   3rd Qu.: 371.6    
##  Max.   :35000   Past Due (31-60 days):  363   Max.   :2251.5    
##                  (Other)              : 1108                     
##                     Occupation    ProsperRating   StatedMonthlyIncome
##  Other                   :28617   C      :18345   Min.   :      0    
##  Professional            :13628   B      :15581   1st Qu.:   3200    
##  Computer Programmer     : 4478   A      :14551   Median :   4667    
##  Executive               : 4311   D      :14274   Mean   :   5608    
##  Teacher                 : 3759   E      : 9795   3rd Qu.:   6825    
##  Administrative Assistant: 3688   (Other):12307   Max.   :1750003    
##  (Other)                 :55456   NA's   :29084                      
##  Term      
##  12: 1614  
##  36:87778  
##  60:24545  
##            
##            
##            
## 

The loan data shows that borrowers preferred to take loans in nicely rounded amounts (e.g. $10000, $15000, $20000). The Borrower APR shows a normal distribution centered around 21% with an anomalous mode around 36%. This behavior will be investigated further, but it is likely that the loan company sets specific APR tiers for borrowers in specific credit grade categories. The most common loan term was 36 months, followed by 60.

Comparing standard credit grade versus the Prosper rating, Prosper tended to give fewer borrowers a positive ‘AA’ rating. However, they also gave more ‘E’ instead of less desirable ‘HR’ ratings at the other end of the spectrum. The credit score is normally distributed with a median near 690. Interestingly, the histogram of all borrowers stated monthly income has a right-skewed long-tail distribution. On a log scale, the income data is centered around a median and mean of value of $4667 and $5608 respectively. Additionally, the available bankcard credit data follows a right-skewed long-tail distribution trend as well.

The most occupations stated by borrowers fall into the ‘Other’ and ‘Professional’ categories. Ignoring these two generic categories, the next top categories are computer programmer, executive, teacher, administrator and analyst. The least common borrower occupations in the dataset are in the categories of judge, dentist, and student.

Ignoring ‘not available’ and ‘other,’ the most common reasons for loans were debt consolidation, home improvement, business, and automotive.

Univariate Analysis

What is the structure of your dataset?

This data set contains 113,937 loans, with 81 features for each loan. I chose to filter these features down to 16 to focus how borrower attributes such as income, credit, and occupation interact with the loan amount and APR.

Some interesting insights are revealed by the initial data summary: - Median stated monthly income is $4667 - Approximately half of the borrowers are homeowners - Average monthly loan payment is $272 - Average loan amount is $8337 - Average amount of current credit lines for each borrower is 10.32 - Most common borrower credit grade is ‘C’ and average credit score is 695

What is/are the main feature(s) of interest in your dataset?

Typically, the key features of loans that a potential borrower evaluates are the term, monthly payment, and APR for a given loan amount. Another area of interest will be to look at the differences between the Credit Grade and the Prosper Ranking since they use the same ranking key (e.g. ‘C’, ‘B’, ‘A’, ‘AA’).

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Conventional wisdom would dictate that the more important features for getting a competitive loan are the borrower credit score and monthly income. Addition, I will investigate the impact of occupation, listing category, available credit, and more.

Did you create any new variables from existing variables in the dataset?

The credit score for each loan was provided as a lower and upper bound. For simplicity, I created a new variable ‘CreditScoreAvg’ to take the average of the bound.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The borrower APR has a normal distribution centered around 0.21%. However, there is an anomalous peaks in the histogram at 0.30% and 0.36%. This could implyProsper implemented a generic ‘common’ rate based on the Prosper Rating feature. The other interesting distribution in data was for the loan amount. The most frequent loan amounts occur at nicely rounded intervals such as $10000 and $15000. The CreditGrade and ProsperRatings factors were reorder to put the “AA” at a higher ranking than the “A” rating. The “ListingCategory” factors were renamed to directly state the category name instead of a number key. Term and ListingCategory features were converted from integer to factor types for ease of plotting.

Bivariate Plots Section

Credit delinquencies shows a weak negative correlation on the borrowers credit score with some interesting outliers. For example, a few borrowers with credit scores greater that 750 had more than 10 current delinquencies). Borrowers with low credit (<600) scores tended to have lower monthly payments (and loan amounts) than those with higher credit scores (>600). Credit lines did not show much coorlation with credit score; the average was about 10 credit lines across the range. Available bankcard credit showed a clear positive correlation with credit score.

The credit grade and prosper rating differed significanly with respect to credit score. This makes sense since classically credit grade should be a function of credit score whereas Prosper likely has proprietary weighting system with multiple input in play.

Credit score shows a weak to moderate negative correlation to APR. In other words, the higher the credit score, the lower the APR. Delinquencies in credit had a very week correlation to APR. Looking at APR in terms of credit grade reveals the intresting histogram results shown earlier for univariate APR. There seems to be a pattern where for each credit grade, there are several standard rates for APR. Most interestingly, the Prosper rating lines up in very clear stepping order with respect to APR. Essentially, one could reasonably estimate APR soley based on the Prosper rating.

The three occupations with the highest median APR are teacher’s aide, nurse’s aide and bus driver. For the lowest median APR, the occupations are Judge, Doctor, and Professor. Monthly income, past $10k, didn’t have a huge factor on APR. Each loan term had a similar mean/median value in relation to the APR.

The median loan amounts for the baby/adoption and debt consolidation categories were higher than the other categories. For less than a monthly income of $10k, there was a weak correlation between monthly income and loan amount. Borrowers with higher credit scores (greater than 650) tended to take larger loans.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The initial set of plots explore credit score versus various data set features. As one would expect, high deliquencies and lack of bankcard credit had a negative effect on the credit score. Interestingly, the number of credit lines open did not have a significant correlation with credit score. Credit grade had a direct relationship credit score but not so much with the Prosper grade. The next set of data set explorations focused on the APR. Although it was clear that credit grade had a relationship to APR, the Prosper grade was very clearly tiered with the APR. In other words, one could reasonably estimate APR soley based on the Prosper rating. There were no surprises in the listing category, The median loan amounts for the baby/adoption and debt consolidation categories were higher than the other categories.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Surprisingly, each loan term had a similar distribution and median relationship versus APR. I would expect that longer term loan come with a higher APR as they do in the home mortgage industry. The highest APR in terms of listing category was cosmetic procedure. I expected a category like automotive or medical to come in the top due to the often urgent nature of such expenses.

What was the strongest relationship you found?

The strongest relationship I found comparing continuous data was with credit score and APR. From a categorical perspective, the strongest relationship was between the Prosper rating and the APR.

Multivariate Plots Section

The most interesting relationships in the prior section are related to the APR, monthly income, and Prosper ratings. However, before diving into the detail on these feature, I want to look at the classic term versus loan amount relationship.

Comparing the loan amount, monthly income and term shows a very clear trend where larger loans are ofter associated with longer loan terms.

In the bivariate plots section, it was shown that a very clear relationship existed between the Prosper rating and the APR. Interestingly, monthly income does not play a large role in the Prosper/APR relationship.

Since the Prosper rating has such a strong relationship with APR, I chose to facet by the Prosper rating and look a additional features. The larger loans are more common for borrowers with better Prosper ratings. Likewise, the higher bank card credit, the better the APR and Prosper rating.

The median credit score for this data set was 690. Higher credit scores almost always are associated with lower APRs and better Prosper ratings.

The home ownership status of borrowers had lesser impact on which Prosper rating and APR they were assigned. Likewise, loan term length seemed similarly distributed across Prosper ratings with the exception of the “HR” rating.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The APR/Prosper rating feature were the main are of interest in this section. Higher credit score, higher bankcard credit and higher monthly income all provided similar beneficial relationships with the borrowers Prosper ranking and APR.

Were there any interesting or surprising interactions between features?

Of interest was the insight that home ownership status had little impact on the Prosper ranking/APR. It is likely that Prosper cares more about the credit score (and bankcard credit, income) of the borrower as opposed to the home ownership status.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Getting a loan with a low APR is what most borrowers look for. Once a borrower is assigned a Prosper rating, it is pretty clear what APR range could be expected.

Plot Two

Description Two

If a borrower wants to take a large loan with a low APR, it helps to have a high monthly income and a good Prosper rating. Not surprisingly, a strong income can help offset a lukewarm Proser rating like “C” in getting a large loan. However, there will be an APR penalty for this case.

Plot Three

Description Three

Credit score matters! There is a reason why so much emphasis is placed on credit score with regards to loans. A good credit score is major factor in getting a good APR as shown above.


Reflection

The Prosper loan data was particularly enjoyable to explore since it did not focus on the usual large home mortgage style loans. Personal loan data lends some intersting insights from the occupation and listing category perspectives. One area of work that was time consuming was determining what plot layouts best described the data. It took a while to arrive at the faceting of the Prosper grade for the multi-variable analysis, but this precipitated an avalanche of more explorations on my part where the end result was gratifying. For future work I would like to include additional insight for featrues as a function of time such as loan origination date and quarter.